Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction
Identifieur interne : 000312 ( Main/Exploration ); précédent : 000311; suivant : 000313Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction
Auteurs : Roger Sayle [Royaume-Uni] ; Paul Hongxing Xie [Suède] ; Sorel Muresan [Suède]Source :
- Journal of chemical information and modeling [ 1549-9596 ] ; 2012.
Descripteurs français
- Pascal (Inist)
- Wicri :
English descriptors
- KwdEn :
Abstract
The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000102
- to stream PascalFrancis, to step Curation: 000670
- to stream PascalFrancis, to step Checkpoint: 000074
- to stream Main, to step Merge: 000315
- to stream Main, to step Curation: 000312
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction</title>
<author><name sortKey="Sayle, Roger" sort="Sayle, Roger" uniqKey="Sayle R" first="Roger" last="Sayle">Roger Sayle</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>NextMove Software</s1>
<s2>Cambridge</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>NextMove Software</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Hongxing Xie, Paul" sort="Hongxing Xie, Paul" uniqKey="Hongxing Xie P" first="Paul" last="Hongxing Xie">Paul Hongxing Xie</name>
<affiliation wicri:level="1"><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Suède</country>
<wicri:noRegion>431 83 Mölndal</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Muresan, Sorel" sort="Muresan, Sorel" uniqKey="Muresan S" first="Sorel" last="Muresan">Sorel Muresan</name>
<affiliation wicri:level="1"><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Suède</country>
<wicri:noRegion>431 83 Mölndal</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">12-0102438</idno>
<date when="2012">2012</date>
<idno type="stanalyst">PASCAL 12-0102438 INIST</idno>
<idno type="RBID">Pascal:12-0102438</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000102</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000670</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000074</idno>
<idno type="wicri:doubleKey">1549-9596:2012:Sayle R:improved:chemical:text</idno>
<idno type="wicri:Area/Main/Merge">000315</idno>
<idno type="wicri:Area/Main/Curation">000312</idno>
<idno type="wicri:Area/Main/Exploration">000312</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction</title>
<author><name sortKey="Sayle, Roger" sort="Sayle, Roger" uniqKey="Sayle R" first="Roger" last="Sayle">Roger Sayle</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>NextMove Software</s1>
<s2>Cambridge</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>NextMove Software</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Hongxing Xie, Paul" sort="Hongxing Xie, Paul" uniqKey="Hongxing Xie P" first="Paul" last="Hongxing Xie">Paul Hongxing Xie</name>
<affiliation wicri:level="1"><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Suède</country>
<wicri:noRegion>431 83 Mölndal</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Muresan, Sorel" sort="Muresan, Sorel" uniqKey="Muresan S" first="Sorel" last="Muresan">Sorel Muresan</name>
<affiliation wicri:level="1"><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Suède</country>
<wicri:noRegion>431 83 Mölndal</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Journal of chemical information and modeling</title>
<title level="j" type="abbreviated">J. chem. inf. model. </title>
<idno type="ISSN">1549-9596</idno>
<imprint><date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Journal of chemical information and modeling</title>
<title level="j" type="abbreviated">J. chem. inf. model. </title>
<idno type="ISSN">1549-9596</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automatic correction</term>
<term>Automatic dictionary</term>
<term>Bioinformatics</term>
<term>Character recognition</term>
<term>Data mining</term>
<term>Gene</term>
<term>Human</term>
<term>Human error</term>
<term>Hyphen</term>
<term>Nomenclature</term>
<term>Ontology</term>
<term>Optical character recognition</term>
<term>Patent rights</term>
<term>Patents</term>
<term>Pharmaceutical industry</term>
<term>Rupture</term>
<term>Text</term>
<term>Typography</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Fouille donnée</term>
<term>Texte</term>
<term>Bioinformatique</term>
<term>Ontologie</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Rupture</term>
<term>Brevet</term>
<term>Propriété industrielle</term>
<term>Dictionnaire automatique</term>
<term>Industrie pharmaceutique</term>
<term>Nomenclature</term>
<term>Gène</term>
<term>Homme</term>
<term>Typographie</term>
<term>Erreur humaine</term>
<term>Correction automatique</term>
<term>Trait union</term>
<term>.</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Brevet</term>
<term>Propriété industrielle</term>
<term>Industrie pharmaceutique</term>
<term>Nomenclature</term>
<term>Homme</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.</div>
</front>
</TEI>
<affiliations><list><country><li>Royaume-Uni</li>
<li>Suède</li>
</country>
</list>
<tree><country name="Royaume-Uni"><noRegion><name sortKey="Sayle, Roger" sort="Sayle, Roger" uniqKey="Sayle R" first="Roger" last="Sayle">Roger Sayle</name>
</noRegion>
</country>
<country name="Suède"><noRegion><name sortKey="Hongxing Xie, Paul" sort="Hongxing Xie, Paul" uniqKey="Hongxing Xie P" first="Paul" last="Hongxing Xie">Paul Hongxing Xie</name>
</noRegion>
<name sortKey="Muresan, Sorel" sort="Muresan, Sorel" uniqKey="Muresan S" first="Sorel" last="Muresan">Sorel Muresan</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000312 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000312 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:12-0102438 |texte= Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction }}
This area was generated with Dilib version V0.6.32. |